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Abstract — We investigate inflection structure of a synthetic 
language using Latin as an example. We construct a bipartite 
graph in which one group of vertices correspond to dictionary 
headwords and the other group to inflected forms encountered in 
a given text. Each inflected form is connected to its corresponding 
headword, which in some cases in non-unique. The resulting 
sparse graph decomposes into a large number of connected 
components, to be called word groups. We then show how the 
concept of the word group can be used to construct coverage 
curves of selected Latin texts. We also investigate a version of the 
inflection graph in which all theoretically possible inflected forms 
are included. Distribution of sizes of connected components of 
this graphs resembles cluster distribution in a lattice percolation 
near the critical point. 

I. Introduction 

Vocabulary of human languages can be viewed as a large 
and complex network or graph, in which individual vertices 
represent words or families of words, and edges represent 
relationships between words. Many such models have been 
studied in recent years, including networks of co-occurrences 
of words in sentences [1|, thesaurus graphs Q, 0, |0), 
WordNet database graphs [5|, and many others @, Q, |8), 

an, ED. 

Much of the aforementioned work has been done in the 
context of the English language, which, among other charac- 
teristic properies, exhibits only a minimal inflection, especially 
if compared to other Indo-European languages. In analytic 
languages like English, grammatical categories and relations 
are handled mostly by the word order, and not by the inflection. 

In contrast to this, synthetic languages such as Latin, Greek, 
Polish, or Russian make an extensive use of inflection, and 
one word in these languages can appear in great many forms, 
reflecting grammatical categories such as tense, mood, person, 
number, gender, case, etc. In the past, there was relatively 
little work done on modelling of inflected languages using the 
paradigm of complex networks, and the goal of this paper is 
to present some initial findings of the author in this area. 

Given the abundance of synthetic languages, one faces the 
issue of selecting one of them for detailed analysis. The 
language which has been most heavily studied and for which 
the largest body of literature exists is a natural choice - and 
there is no doubt that Latin must be chosen given these criteria. 
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Fig. 1. Visualization of the inflection graph of De bello Gallico. 



Latin literature stretches for over 20 centuries ranging from 
the literature of ancient Rome right up to the 21st century. 
Latin served as lingua franca for Western civilization for many 
centuries, so there is no shortage of Latin texts, and a vast 
number of them is available in electronic form. 

In addition to this, Latin inflectional system has been studied 
so extensively that literally every single aspect of this system 
is exceptionally well documented. Software tools which "un- 
derstand" Latin inflection system are also readily available, 
including an excellent open-source WORDS program written 
by W. Whitaker ifTTH. 

II, Motivation 

One of the motivation of this work was the problem of 
vocabulary size. Let us suppose that we want to count how 
many distinct words a given work contains - for example, for 
the purpose of comparing two works and deciding which one 
is more "difficult" as long as vocabulary is concerned. How 
do we do this in a language like Latin, where one dictionary 
headword can have as many as hundreds of different forms? 
In addition to this, in some cases, one inflectional form can 
correspond to more than one dictionary headword, and one 
must deduce from the contexts which one to choose. 

The earliest qualitative approach to Latin vocabulary can be 
found in the Ph.D. thesis of Paul Bernard Diederich [12] from 
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Fig. 2. Example of a connected component of the inflection graph of De hello 
Gallico. Vertices with capitalized labels correspond to dictionary headwords, 
and those in lowercase to inflected forms. 



1939, who performed a count of Latin headwords occurring 
in a selection of texts from 200 Latin authors totaling over 
200,000 words. He did this entirely by hand - a hardly 
attractive proposition in todays computerized world. 

If one wants to perform computerized count of words, 
and wants to count various inflectional forms of the same 
headword as one entry, one has to understand the relationship 
between inflected forms and headwords. This can be done by 
introducting the concept of an inflection graph. 

III. Inflection graph 

The inflection graph for a given text is a bipartite graph 
which is constructed as follows. First, we create a set of 
vertices, to be denoted by B, corresponding to all distinct 
words of the text. I a word occurs in the text more than once, 
it is represented by one vertex nevertheless. We then go over 
all vertices in B, and check which headwords can possibly 
correspond to each of words in B. These headwords form 
another set of vertices, to be called A. In practice, headwords 
may be obtained by using W. Whitaker's WORDS program. 
As in the case of B, elements of A are unique, so that each 
headword appears in A only once. 

If an element of A is a headword corresponding to some 
element of B, then these two are connected. Obviously, for 
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Fig. 3. Degree distribution of headword vertices. 



most headwords in A, there are many corresponding inflected 
forms in B, so an element of A is typically connected to many 
elements of B. For example, dicunt (they say) and dixit (he 
said) are both inflected forms of the verb dico, thus we will 
have a vertex in A corresponding to dico connected to vertices 
in B corresponding to dicunt (they say) and dixit. 

However, the opposite can also be true: in some instances, a 
word can be an inflected form of more than one headword, so 
that elements of B are someties connected to more than one 
element of A. As an example, consider the word sublatus, 
which could be a form of tollo (lift, raise) or suffero (bear, 
endure), thus a vertex of B corresponding to sublatus will be 
connected to vertices of A corresponding to tollo and suffero. 

The bigraph obtained using the aforementioned procedure 
is typically quite large but not very dense. For example, for 
the classic work of Julius Caesar De bello Gallico (published 
in 50s or 40s BC), consisting of 51,300 words, this bigraph 
has 5,377 of vertices in A, 10,977 vertices in B, and 15,349 
edges. Figure [T] shows a visualization of this graph done by 
Walrus 1 13], a software tool for visualizing large graphs using 
3D hyperbolic geometry and a fisheye-like distortion. Degree 
distribution for vertices of type A (headwords) for De bello 
Gallico, as well as some other works (to be discussed later) is 
shown in Figure [3] The distribution seems to have features of 
a power law. We also observe that the degree of a headword 
represents the number of inflected forms of that headword 
appearing in the text - and, as one can see from the figure, 
this number can easily approach 100. 

A. Connected components 

An important feature of the inflection graph is that it is, 
obviously, not connected, and that it has a large number of 
disjoint components - in the case of De bello Gallico, 3,740 
components. Each of these components will be called a word 
group. Example of such a component is shown in Figure [2] 
It consists of four headwords (written in uppercase) and 18 
inflected forms (written in lowercase). 
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Fig. 4. Rank-frequency distribution of group occurrences in Vulgate (+) and 
De hello Galileo (x). 
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Fig. 5. Coverage curves for words. 



In most cases, words within one group are closely related 
semantically, but not always - occasionally words with quite 
different meaning may belong to the same group, as in the 
case of tollo and suffero mentioned above. This is especially 
true for the largest group, as we will see later on. Nevertheless, 
word groups usually closely correspond to what linguists call 
word families. The main advantage of word groups lies in 
the fact that they can be easily determined using well known 
algorithms for computing connected components of a graph. 
We used python package NetworkX lfT4ll for this purpose. 

B. Rank-frequency distribution for groups 

Having the concept of the word group defined, we can 
now label all groups with distinct labels, for example, with 
consecutive integers i. If a given word from the text belongs 
to a group labelled i, we will say that it is an occurrence of the 
group i. Obviously, some groups occur more often than others, 
so we can sort all groups in decreasing order of occurrences in 
the text. Position of a group on this list will be denoted by r 
(rank), and the number of occurrences of that group in the text 
will be denoted by n 9 (r). Similar rank-frequency function for 
individual words will be denoted by n w (r). Figure El shows 
log-log plots of n g (r) versus r for two very different works, 
namely the aforementioned De bello Gallico and for the Latin 
Bible translation of St. Jerome known as Vulgate (AD 390 to 
405). In both cases non-Zipfian behavior is very clear, that is, 
the resulting curves a not straight lines. 

C. Coverage curves 

A helpful concept used to describe statistical properties of 
texts is the so called coverage curve. If we assume that the 
reader of the text knows k top-ranking words, then the text 
coverage, or the fraction of known words in the text is defined 
as 

Y. h r=i n w{r) 



where N is the total number of distinct words in the text. 
The graph of C w (k) versus k is known as the coverage curve. 
Analogously, for groups we can define 
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where M is the total number of word groups in the text. 

In order to illustrate the difference between C w (k) and 
C g (k), we constructed these coverage curves for five different 
texts. These texts, in addition to already mentioned De bello 
Gallico and Vulgate, include Cicero's Philiphicae (written 44- 
43 BC) and collection of medieval stories Gesta Romanorum 
(13th-14h century). We also wanted to include some longer 
contemporary text of considerable length, which proved to be 
difficult due to scarcity of such texts. Finally we somewhat 
artificially produced a text by combining two shorter docu- 
ments. The resulting file is titled Encyclicals and consists of 
two encyclicals of John Paul II, Ut Unum Sint and Evangelium 
Vitae. Both of these were issued in the same year (1995), 
thus they are sufficiently similar in style to consider them as 
parts of one document. Texts of encyclicals were obtained 
from Vatican repository ifTBll . and the remaining texts from 
The Latin Library [16]. In order to make sure that differences 
in text size do not interfere with our analysis, all texts have 
been truncated so that they have the same length as De bello 
Gallico, that is, 51,300 words. The coverage curves were 
obtained by the following procedure: 

• The text was converted to lowercase, all punctuation 
marks and digits were removed. 

• A list of words was created, together with number of 
occurrences of each word. 

• For each word in the above list, we used Whitaker's 
WORD program to find corresponding headwords. 

• The inflection graph was constructed, and its connected 
components determined using Python/NetworkX script. 

• The frequency of occurrence of each word group was 
computed, and coverage curve plotted. 
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Fig. 6. Coverage curves for groups. 



Figures B] and [6] show coverage curves for words and groups 
for all five sample texts. Comparing these figures one can 
immediately notice two things. First of all, the coverage 
converges to 100% faster or slower, depending on the text. For 
words coverage, Vulgate and Gesta Romanorum are clearly 
converging faster than both classical Latin texts of Caesar 
and Cicero and contemporary encyclicals. This agrees with 
the general consensus of latinists who consider medieval texts 
"easier" than classical one. 

Another observation is the difference between words cover- 
age and group coverage. Much smaller number of groups than 
individual words is needed to achieve the same coverage, as 
one would naturally expect. Table II] lists the number of words 
and groups required to obtain 95% and 98% coverage in all 
five sample texts. These two numbers have been used because 
it is often argued that in order to read a given text with minimal 
distraction one needs to know enough words to cover at least 
95%-98% running words of the text. Such high coverage is 
also needed to transfer reading skills from one's mother tongue 
to another (foreign or second) language. It is encouraging for 
students of Latin that the number of word groups required to 
obtain such coverage is relatively low. Nevertheless, one has to 
be careful interpreting this table, remembering that one group 
may consist of many dictionary headwords. 

Table [I] also reveals some interesting differences between 
classical texts written in literary and vulgar Latin. In order 
to reach 95% coverage of Philipicae one needs much larger 
number of words than in the case of Vulgate. On the other 
hand, Vulgate requires larger number of word groups to 
achieve the same coverage. This shows that Cicero used the 
inflection system of the Latin language much more skillfully 
- he used fewer word groups than St. Jerome, yet from these 
he obtained a larger number of inflected forms! 

D. Normalized coverage 

In order to make coverage curves independent of N and M, 
one can define the normalized coverage as 



TABLE I 

Table of number of words (w) and groups (g) required to 

obtain 95% and 98% coverage. 
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Fig. 7. Graph of c g (x) — x as a function of x for Encyclicals. Black dots 
represent data (every 60 point only plotted), and continuous curve is the line 
of the best fit given by eq. 17). 



c g (x) = C g (Mx), 



(4) 



where x € [0,1]. If the exact form of n w (r) or n g (r) was 
known, the corresponding coverage could be computed. For 
example, in the case of the Zipf law, n w (r) = A/r where A 
is a normalization constant. In such a case, sums required in 
eq. (fTJ can be computed in closed forms, yielding 

•^(Nx + 1) +7 



>{x) 
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where ^f is the digamma function, and 7 = 0.57721566 . 
the Euler-Mascheroni constant. 

Unfortunately, in the case of the group coverage, we do 
not know what is the form of n g (r), thus a similar formula 
cannot be produced. It is possible, however, to obtain empirical 
fit with small number of parameters. This can be readily 
understood if we plot c g (x) — x as a function of x, as shown 
in Figure 17] where, in order to avoid clutter, only coverage 
data for one data set (Encyclicals) is presented. Symmetry of 
this figure suggests that c g (x) — x may be approximated by 
the graph of x a (l — a;)' 9 , where a, (3 are parameters of the fit. 

We, therefore, fitted the function 
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c w (x) = C w (Nx) 



(3) 



to the normalized coverage data c g (x). The fit, although good, 
was less than ideal, so we introduced another two parameters, 



TABLE II 
Fit parameters 
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It should be noted that this choice of the equation does not 
have any particular meaning, it just has been observed that it 
produces a good fit, thus it is a convenient way to describe 
coverage curves. Sample fit produced using this equation is 
shown in Figure 17] (continuous line). Values of parameters for 
all five sample texts are shown in table [TT] From the form 
of eq. (|7j one can see that the smaller of parameters a and 7 
controls the initial steepness of the normalized coverage curve, 
and therefore we will define 



7/ = min{a, 7}. 



(8) 



The value of 77 can be given an interpretation related to the 
nature of the underlying text. If 77 is large, it means that the 
normalized coverage curve is growing slowly - that is, high 
percentage of all word groups present in the text is needed to 
achieve, say, 95% text coverage. On the other hand, if 77 is 
small, this means steep coverage curve, so that high coverage 
is reached quickly. 

For that reason, one can say that the parameter 77 tells us to 
what degree is the the inflection mechanism of the language 
used in order to provide high text coverage. Small 77 means 
high reliance on the inflection mechanism. From Table III] we 
can therefore conclude that Vulgate and Gesta Romanorum do 
not rely on the inflection as much as the classical works or 
modern encyclicals. 

IV. Inflection graph for a dictionary 

In previous sections, we were considering inflection graphs 
for individual literary works. It is possible to obtain such graph 
for the whole Latin language - or, to be more precise, for all 
words from a large dictionary. In Whitaker's WORDS dic- 
tionary, there are 35,670 distinct headwords. Using WORDS 
program, one can construct all possible inflected forms for all 
these headwords, resulting in a list of 1,032,669 word forms. 
We shall note that those are all theoretically possible forms, 
and that not all of them are attested in the surviving corpus 
of Latin texts. For example, the form coquor is a theoretically 
possible passive, present tense, first person singular of coquo 
(to cook), yet in does not seem to be attested in Latin 
texts ifTTl . 

If we connect all headwords from the dictionary with all 
theoretically possible inflected forms, we obtain an inflection 
graph of the whole language. The resulting graph is very 
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Fig. 8. Distribution of sizes of headword groups in all Latin words from 
Whitaker's WORDS dictionary. 



sparse, having 1,117,394 edges, that is, only 10% more than 
the number of edges. The reason fort his is that the vast 
majority of inflected forms correspond to only one dictionary 
headword. The number of connected components of this graph 
is 50,847, and their size distribution appears to follow a power 
law. This is illustrated in Figure [8] If H{m) is the number of 
groups with m headwords, then the line of the best fit shown 
in Figure IS] closely follows the power law 



H{m) 



(9) 



where the value of the exponent r obtained by fitting a straight 
line to datapoints excluding several largest clusters is r = 
3.32. 

This phenomenon strongly resembles percolation. The scal- 
ing theory for percolation predicts that connected components 
exhibit close to power-law behavior near the percolation 
threshold. This seems to suggest that the sparse inflection 
graph discussed here may be close to its percolation threshold. 

Let us now make some remarks regarding largest compo- 
nents, corresponding to data points lying above the fitted line 
on the right of Figure [8j This deviation could be caused by a 
finite size of the graph. Similar behavior is often observed in 
numerical simulations of percolation in finite systems. 

Obviously, the headwords belonging to very large clusters 
cannot be all semantically related, and the fact that they are 
grouped together may be somewhat related to the inclusion 
of all theoretically possible inflected forms. A path joining 
two vertices may exist solely because it passes through some 
inflected form which is unattested. In Figure [9] the largest 
cluster of the dictionary inflection graph is shown. One can see 
from this picture that the cluster is composed of several one- 
level trees (stars) loosely connected via a number of bridges. 
Some of these bridges are likely "artificial" (unattested) forms, 
and removing them would most likely divide the big cluster 
into a number of smaller clusters. 




Fig. 9. Visualization of the largest component of the dictionary inflection 
graph. 



V. Conclusion 

We presented some preliminary findings regarding prop- 
erties of inflection graphs. The concept of the word group 
defined as a connected component of the inflection graph 
appears to be useful in describing the vocabulary structure of 
the text. The parameter r) could be used to characterize some 
aspects of the text difficulty, describing the balance between 
the diversity of vocabulary versus the diversity of inflected 
forms. Obviously, if one wants to categorize texts according 
to their perceived difficulty, vocabulary is not the only factor. 
Structure of sentences and word order are equally important, 
or sometimes even more important, thus such categorization 
scheme will most likely involve several parameters. The author 
hopes that an automated system classifying Latin texts is even- 
tually constructed, since such system would be very beneficial 
to students of Latin. In languages such as English, sets of 
graduated books exist, and are routinely used in classrooms. 
Students start from very simple texts and seamlessly progress 
to texts of increasing difficulty. In Latin it is very unlikely 



that such set of books would be ever written, given the current 
status of the language. Nevertheless, texts of various levels of 
difficulty do exist in Latin - yet it is often hard for beginners to 
determine beforehand which texts (out of thousands available) 
should be read first, and which should be left till later. We 
plan to develop ideas presented here to create an automated 
system which could be used to "measure" text difficulty and 
eventually mine internet repositories to produce a list of Latin 
text forming a sequence with gradually increasing difficulty. 

The similarity of the dictionary inflection graph to percolat- 
ing network is also currently investigated. We are collecting a 
large corpus of Latin texts in order to remove from the graph 
all unattested words. Due to variation of spelling, especially in 
Medieval texts, this task cannot be fully automated, thus final 
results are not yet available. We also plan to find out whether 
other synthetic languages exhibit similar scaling phenomenon 
as the Latin inflection graph. 
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